Homework 4
請將HTML檔上傳至Ceiba作業區。回答作業時建議使用 "三明治" 答題法。也就是說,先說明要做什麼,然後列出程式碼與結果,最後說明這些結果的意義。作業自己做。嚴禁抄襲。不接受紙本繳交,不接受遲交。請以英文或中文作答。
Data visualization is an useful technique that is often used to understand the characteristics of a dataset. We are going to practice this skill using a university department offer of admission dataset.
A large portion of high school students get admitted to universities through an application and screening process that require each university department of offer admission to applicants first before students can choose where they wants to go. If we think of applicants as the customers of an academic department, then the duplications of offered applicants for different departments can be used to understand the competition relationships between academic departments. We are going to visualize this competition relationships using the University Department Offer of Admission Dataset (UDOAD).
UDOAD was collected through a popular online offer searching service (https://freshman.tw/) for the 2017 academic year. We collected the offers recieved by each applicant as well as the basic information for academic departments. This dataset contains two files
import numpy as np
import pandas as pd
stu_adm = pd.read_csv('ds/student_admission106.csv', encoding="utf-8", dtype=str)
uname = pd.read_csv('ds/univ_name106short1.csv', encoding="utf-8", dtype=str)
all_depid = stu_adm['department_id'].unique()
all_stuid = stu_adm['student_id'].unique()
ndepid = all_depid.shape[0]
nstuid = all_stuid.shape[0]
print("There are %d students and %d departments in total." % (nstuid, ndepid))
print("offers received by students:")
stu_adm.head(20)
The department_id can uniquely identify an academic department. We do not care about the ranking of admission here, and you should just ignore the "state" column. We only care about the "co-application" relations in this dataset. You should use student_id to uniquely identify a student applicant.
print("academic department basic information:")
uname.head(10)
You can use this dataset to identify the name of a department_id. The school_name and department_name contain the "full name" of a academic department. To facilitate visualization, we also provide "shorter names" in school_name_abbr and department_name_abbr. The category_name is the field of an academic department. This field is very important in our visualization exercise since you should color each data point according to its category_name.
(20%) Our focus is the relationships between department. In order to do this we need to convert the raw data into a "matrix" representation. Each row represent an academic department, and each column represent a student applicant. The value of the cell is 1 if a student applied for admission to the corresponding academic department, and 0 otherwise.
To avoid potential numerical problems, we only include an academic department if it received ten or more applications. Moreover, we only include a student applicant if he or she applied for more than one academic department. You need to make sure that both conditions are satisfied in your processed dataset.
Note that the two conditions should be satisfied "as is" in your final dataset. For example, if a student applied for two departments in the original dataset, and one of the department was removed, then this student should be removed as well because the student only applied for one department in the processed dataset.
Answer the following question:
student_id = pd.get_dummies(stu_adm['student_id']).groupby(stu_adm['department_id']).apply(max)
# a = pd.crosstab(index=stu_adm['department_id'], columns = stu_adm['student_id'])
# group_std = stu_adm.groupby('department_id')
# table = pd.DataFrame()
# count = 0
# for i in all_depid:
# if count %100 == 0:
# print(count)
# count += 1
# group = group_std.get_group(i)
# std_dummy = pd.DataFrame(data = np.array([[ 1 for j in range(len(group['student_id']))]]),index = [i], columns = group['student_id'])
# table = pd.concat([table, std_dummy], sort = False)
# table = table.replace(np.nan, 0)
def row_clean(data):
cleaned_data = data[data.sum(axis = 1) >= 10]
if cleaned_data.shape == data.shape:
return cleaned_data, False
# cleaned_data = cleaned_data.reset_index(drop=True)
return cleaned_data, True
def column_clean(data):
cleaned_data = data.loc[:, data.sum(axis = 0) > 1]
if cleaned_data.shape == data.shape:
return cleaned_data, False
return cleaned_data, True
revised_table = student_id.copy()
print('Before data cleaning: ','department number: ', revised_table.shape[0], ', student applicant number: ', revised_table.shape[1])
flag_row = True
flag_col = True
check = True
while(check):
revised_table, flag_row = row_clean(revised_table)
revised_table, flag_col = column_clean(revised_table)
if (flag_row or flag_col):
check = True
else:
check = False
print('After data cleaning: ','department number: ', revised_table.shape[0], ', student applicant number: ', revised_table.shape[1])
top_ten = revised_table.sum(axis = 1).sort_values(ascending=False).head(10)
df = pd.DataFrame({'sum':top_ten.values, 'department_id':top_ten.index})
top_ten_info = uname[ uname['department_id'].isin( top_ten.index)]
top_ten_info = top_ten_info.reset_index(drop = True)
res = top_ten_info.loc[:,['department_id', 'school_name','department_name']]
res = pd.merge(res, df, on=['department_id']).sort_values(by = 'sum',ascending=False)
res = res.reset_index(drop = True)
print(res)
(50%) Visualize academic departments in the following questions. In all plots, you should color data points according the academic department's category. Moreover, you should provide a legend or a picture that illustrate the mapping between colors and category names. Visualize the data using two-dimensional plots. Note that it is your responsible to study the document of libraries of your choice and make sure that the results are reasonable.
def plot(data, cols, x= None, y= None):
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = 10, 10
result = pd.DataFrame(data, index = depart_id, columns= cols)
result['category_name'] = [0 for j in range(len(result))]
for i in CATEGORY:
### append category name to data
name = uname.loc[uname['category_id']== i]['category_name'].unique()[0]
result.loc[ result.index.isin(cat_dict[i]) , ['category_name']] = name
ax = sns.scatterplot(x = x, y= y, data = result, hue = 'category_name')
plt.show()
return ax
CATEGORY = ['1', '2', '3', '4', '5', '6', '7', '9', '99', '8', '10']
cat_dict = dict()
for i in CATEGORY:
cat_dict[i] = (uname.loc[uname['category_id'] == i])['department_id'].values
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.decomposition import PCA
input_data = revised_table.copy()
depart_id = list(input_data.index)
pca = PCA(n_components = 8)
transform = pca.fit_transform(input_data)
for i in range(1, 7):
for j in range(i+1, 8):
plot(transform, ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'], 'p'+ str(i), 'p'+ str(j))
plot(transform, ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'], 'p4', 'p7')
from sklearn.manifold import MDS
input_data = revised_table.copy()
mds = MDS(n_components = 2, metric = True)
transform = mds.fit_transform(input_data)
plot(transform, ['x', 'y'], 'x' , 'y')
from sklearn.manifold import MDS
input_data = revised_table.copy()
mds = MDS(n_components = 2, metric = False)
transform = mds.fit_transform(input_data)
plot(transform, ['x', 'y'], 'x' , 'y')
from sklearn.manifold import LocallyLinearEmbedding
input_data = revised_table.copy()
lle = LocallyLinearEmbedding(n_components=2, n_neighbors = 20)
transform = lle.fit_transform(input_data)
plot(transform, ['x', 'y'], 'x' , 'y')
from sklearn.manifold import LocallyLinearEmbedding
input_data = revised_table.copy()
lle = LocallyLinearEmbedding(n_components=2, n_neighbors = 40)
transform = lle.fit_transform(input_data)
plot(transform, ['x', 'y'], 'x' , 'y')
pca = PCA(n_components = 100)
input_data = revised_table.copy()
transform = pca.fit_transform(input_data)
lle = LocallyLinearEmbedding(n_components=2, n_neighbors = 20)
transform_final = lle.fit_transform(transform)
plot(transform_final, ['x', 'y'], 'x' , 'y')
from sklearn.decomposition import KernelPCA
input_data = revised_table.copy()
kernel_pca = KernelPCA(n_components = 8, kernel='rbf')
transform = kernel_pca.fit_transform(input_data)
for i in range(1, 8):
for j in range(i+1, 9):
plot(transform, ['p1','p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'], 'p'+ str(i) , 'p' + str(j))
plot(transform, ['p1','p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'], 'p4', 'p7')
from sklearn.decomposition import KernelPCA
input_data = revised_table.copy()
kernel_pca = KernelPCA(n_components = 8, kernel='cosine')
transform = kernel_pca.fit_transform(input_data)
for i in range(1, 8):
for j in range(i+1, 9):
plot(transform, ['p1','p2', 'p3', 'p4', 'p5', 'p6','p7', 'p8'], 'p'+ str(i) , 'p' + str(j))
plot(transform, ['p1','p2', 'p3', 'p4', 'p5', 'p6','p7', 'p8'], 'p3', 'p4' )
from sklearn.manifold import TSNE
input_data = revised_table.copy()
transform = TSNE(n_components=2).fit_transform(input_data)
plot(transform, ['x','y'],'x','y')
from sklearn.manifold import TSNE
input_data = revised_table.copy()
transform = TSNE(n_components=2,metric = 'cosine').fit_transform(input_data)
final_plot = plot(transform, ['x','y'], 'x', 'y')
from sklearn.manifold import TSNE
input_data = revised_table.copy()
transform = TSNE(n_components=2, metric='jaccard').fit_transform(input_data)
plot(transform, ['x','y'], 'x', 'y')
(30%) Select the most promising visualization method in the previous question and refine the result. Your should color points by department category. Label each data point with its name so that we can easily identify a data point on the picture. Moreover, you should try to reduce the problem caused by overlapping points and labels. Output a picture that is large enough so that a user can easily identify a department and its neighbors. Jupyter Notebook has limitations on the largest picture size. To overcome this problem, consider output the picture to a separate file and submit the file for grading. Your score depends on how useful, readable, and visually pleasing of your visualization results.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns; sns.set()
from sklearn.decomposition import PCA, KernelPCA
input_data = revised_table.copy()
depart_id = list(input_data.index)
pca = PCA(n_components = 100)
transform = pca.fit_transform(input_data)
transform_final = TSNE(n_components = 2,metric = 'cosine').fit_transform(transform)
plot(transform_final, ['x','y'], 'x', 'y')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = 100, 100
fig = plt.figure()
result = pd.DataFrame(transform_final, index = depart_id, columns= ['x', 'y'])
result['category_name'] = [0 for j in range(len(result))]
result['department_name'] = ['' for j in range(len(result))]
for i in CATEGORY:
### append category name to data
name = uname.loc[uname['category_id']== i]['category_name'].unique()[0]
result.loc[ result.index.isin(cat_dict[i]) , ['category_name']] = name
for i in result.index:
data = uname.loc[uname['department_id']== i]
name = data['school_name'].values[0]+ data['department_name'].values[0]
result.loc[result.index == i,'name'] = name
size = [1000 for j in range(len(result))]
ax = sns.scatterplot(x = 'x', y= 'y', data = result, hue = 'category_name', s =size)
for i in range(len(result)):
ax.annotate(result.iloc[i,:]['name'],(result.iloc[i,:]['x'],result.iloc[i,:]['y']))
plt.show()
fig.savefig('q3.png') # save the figure to file